options(scipen=666)
library(data.table)
library(ggplot2)
You can use ?fread to read csv, txt, or any other file format. In general fread is smart enough to realise how the data is delimited. Therefore you don’t need to pass a sep=‘,’ argument.
file_name = "prices_nasdaq.csv"
dt = fread(file_name)
dt
?fread has the following intro: Similar to read.table but faster and more convenient. All controls such as sep, colClasses and nrows are automatically detected. bit64::integer64 types are also detected and read directly without needing to read as character before converting.fread you might use eventually.In general fread will just work 95% of the times, but in some cases you need to use some parameters.
dt[1:5]
In this case I simply construct a boolean vector where ticker matches a string. Internally these are a vector of TRUE, TRUE, … for the rows I want to keep and a vector of FALSE, FALSE, … for the rows that don’t satisfy the boolean condition.
dt[ticker == 'MELI']
If call str on the data.table we get a list of the columns and their class (or data type). Here we can see date is being read as character and the format seems year-month-day. The rest of the variables seem correct to me.
str(dt)
## Classes 'data.table' and 'data.frame': 7880855 obs. of 8 variables:
## $ date : chr "2018-04-26" "2018-04-27" "2018-04-30" "2018-05-01" ...
## $ open : num 23 26.2 25.6 25.2 27 ...
## $ high : num 28.2 26.7 26.4 27 29.8 ...
## $ low : num 22.2 24 24.9 25 26.9 ...
## $ close : num 27 25 24.9 26.9 29.1 ...
## $ volume : num 3906000 626300 229000 474500 456900 ...
## $ adjusted: num 27 25 24.9 26.9 29.1 ...
## $ ticker : chr "LASR" "LASR" "LASR" "LASR" ...
## - attr(*, ".internal.selfref")=<externalptr>
## - attr(*, "index")= atomic
## ..- attr(*, "__ticker")= int 566359 566360 566361 566362 566363 566364 566365 566366 566367 566368 ...
The idea now is to change the class ( data type ) of the date column. To do this we will use the assignment by reference operator ( := ), also called walrus operator.
dt[, date:=as.Date(date, format="%Y-%m-%d")]
str(dt)
## Classes 'data.table' and 'data.frame': 7880855 obs. of 8 variables:
## $ date : Date, format: "2018-04-26" "2018-04-27" ...
## $ open : num 23 26.2 25.6 25.2 27 ...
## $ high : num 28.2 26.7 26.4 27 29.8 ...
## $ low : num 22.2 24 24.9 25 26.9 ...
## $ close : num 27 25 24.9 26.9 29.1 ...
## $ volume : num 3906000 626300 229000 474500 456900 ...
## $ adjusted: num 27 25 24.9 26.9 29.1 ...
## $ ticker : chr "LASR" "LASR" "LASR" "LASR" ...
## - attr(*, ".internal.selfref")=<externalptr>
## - attr(*, "index")= atomic
## ..- attr(*, "__ticker")= int 566359 566360 566361 566362 566363 566364 566365 566366 566367 566368 ...
Now we can see the date is of Date class.
This is probably one of the features I use the most as it’s needed often to understand the data. The .N symbol is used to count the number of rows from a group (in this case ticker). you can read the docs here on ?"N". Using .N creates a column called N.
dt[, .N, ticker]
We can easily sort this result by decreasing order by using the - sign before the column that gets generated with .N. I often pipe this operation with another [][] call.
dt[, .N, ticker][order(-N)]
Sometimes it’s useful to get more observations printed! We can AGAIN pipe the result and this results in a one liner.
With the following command I get the top 10 groups.
dt[, .N, ticker][order(-N)][1:10]
It’s often useful to save this results and plot a histogram. In this case to understand how many days of data are available for each ticker.
qdays = dt[, .N, ticker][order(-N)]
qdays[1:2]
The following plot shows the distribution of the amount of days available for each ticker. We can at least be wary this data might have issues. But as these are companies/ETFs it’s possible these were created in different periods of time. Therefore we can’t say the data has issues or not.
ggplot(qdays, aes(x=N)) +
geom_histogram() +
xlab("Amount of days available per ticker")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data.table provides the shift function for this. We first need to sort the data.table, take the adjusted close price lag and then compute the percent difference to the previous value.
There are multiple ways to sort a DT. For simplicity I’m often using sort but there are other options such as setkey or setkeyv.
dt = dt[order(ticker, date)]
dt
We sort the table in ascending order as this is how the data gets generated.
dt[ticker == "MELI"]
I pass the adjusted variable to shift, n=1 and group by ticker. The idea is to create a lag for each ticker group.
dt[, lag:=shift(adjusted, n=1), by=ticker]
Now the lag column looks like this for the “MELI” ticker that is BTW a company siimlar to Amazon.
dt[ticker == "MELI"]
I can now compute the daily return of each asset. This shows how the stock is moving each day and will allow some calculations to be done in the next section.
dt[, R:=ifelse(lag == 0, 0, adjusted / lag - 1)]
I’ll remove the first observation that has NA values for simplicity
dt = dt[complete.cases(dt)]
We can start with a top-down approach and understand what is the return distribution of all available assets.
dt[, quantile(R, c(0.01, 0.25, 0.5, 0.75, 0.99))]
## 1% 25% 50% 75% 99%
## -0.11111111 -0.01242236 0.00000000 0.01203080 0.13249998
It’s clear the median return is zero for all assets and this makes sense for this problem.
This is probably the second feature of DT I use the most. Computing a function by a group. In this case I’m grouping by ticker and applying a set of functions to the R variable. This returns a new object I called ret_agg where each new column is the result of the group-by operation. Therefore the result R_median variable will have the median of each ticker.
ret_agg = dt[, .(R_median = median(R),
R_mean = mean(R),
R_sd = sd(R),
R_mad = mad(R)),
by=ticker]
ret_agg
It seems there are quite a bit of outliers in the data. The median seems like a better metric to get a broad idea of the return of an asset.
ggplot(ret_agg, aes(R_mean)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(ret_agg, aes(R_median)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(ret_agg, aes(R_mad)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
By using mad (median absolute deviation) and the median we get a better idea of the market returns at the ticker level compared to using the mean or sd.
ggplot(ret_agg, aes(R_mad, R_median)) +
geom_point() +
ggtitle("Relation between MAD and median for each ticker")
The above is for demo purposes but in practise I believe using simple functions helps code readability.
To summarize some of the above operations I would create two functions:
# reading the data with fread and changing the date class to Date
read_data <- function(file_name){
dt = fread(file_name)
dt[, date:=as.Date(date, format="%Y-%m-%d")]
dt
}
compute_returns <- function(dt){
dt = dt[order(ticker, date)] # sorting
dt[, lag:=shift(adjusted, n=1), by=ticker] # create a lag
dt[, R:=ifelse(lag == 0, 0, adjusted / lag - 1)] # compute the return
dt[complete.cases(dt)] # remove NA columns
}
The following code looks much clearer to follow to me. I first read the data:
dt = read_data("prices_nasdaq.csv")
Then I compute the returs:
dt = compute_returns(dt)